Clustering Categorical Data Based on Combinations of Attribute Values

نویسندگان

  • Hee-Jung Do
  • Jae Yearn Kim
چکیده

Clustering is an important technique for exploratory data analysis. While most of the earlier clustering algorithms focused on numerical data, real-world problems and data mining applications frequently involve categorical data. Here, we propose a new clustering algorithm for categorical data that is based on the frequency of attribute value combinations. Our algorithm finds all the combinations of attribute values in an object, which represent a subset of all the attribute values, and then groups the object using the frequency of these combinations in each cluster. As our algorithm considers all the subsets of attribute values in an object, objects in a cluster have not only similar attribute value sets but also strongly associated attribute values. Also, the proposed algorithm is not the clustering method using the similarity between only two objects, but rather uses the similarity between an object and clusters. Therefore, it provides global information in clustering results. We conducted experiments with real and synthetic data sets to evaluate FAVC. We show that FAVC is more scalable and provides higher quality results than the previous method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incremental Algorithm to Cluster the Categorical Data with Frequency Based Similarity Measure

Clustering categorical data is more complicated than the numerical clustering because of its special properties. Scalability and memory constraint is the challenging problem in clustering large data set. This paper presents an incremental algorithm to cluster the categorical data. Frequencies of attribute values contribute much in clustering similar categorical objects. In this paper we propose...

متن کامل

Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure

K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency...

متن کامل

Context-Based Distance Learning for Categorical Data Clustering

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the d...

متن کامل

An approach to deal with time- evolving Categorical data based on NIR using Clustering

Data clustering is an important technique for exploratory data analysis and has been the focus of substantial research in several domains for decades among which Sampling has been recognized as an important technique to improve the efficiency of clustering. However, with sampling applied, those points that are not sampled will not have their labels after the normal process. Although there is a ...

متن کامل

Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode

The original k-means algorithm is designed to work primarily on numeric data sets. This prohibits the algorithm from being applied to categorical data clustering, which is an integral part of data mining and has attracted much attention recently. The k-modes algorithm extended the k-means paradigm to cluster categorical data by using a frequency-based method to update the cluster modes versus t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009